========================================================
## [1] "/Users/craig/Downloads/Udacity/White Wine Quality Project"
## [1] "L6 Project White Wine Analysis.Rmd"
## [2] "L6_Project_White_Wine_Analysis.html"
## [3] "L6_Project_White_Wine_Analysis.Rmd"
## [4] "wineQualityInfo.txt"
## [5] "wineQualityWhites.csv"
The white wine dataset explored in this report was created using variants of the Portuguese “Vinho Verde” wine. This dataset consists of 4898 observations (wine samples) and 13 variables including the ordinal data point, Quality. The input variables for samples include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
In the effort to help guide this exploration, we’ll output the fundamental and summary statistics.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The results show our dataset consists of 4898 observations and 13 total variables.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The summary statistics (including the IQR, minimum and maximum values, median, and mean) provides more detail for each variable and will help guide this exploration as we seek to investigate single variables as well as their relationships. They will also be considered when compared with univariate analysis plots.
Looking at the distribution of data for quality, a rating of 6 is the most frequent rating followed by 5 and 7 respectfully. So which factors have a determination on wine quality? Are there specific traits that make for a better quality white wine?
## Warning: Removed 7 rows containing non-finite values (stat_bin).
The distribution of residual sugar appears to be skewed right with most wines consisting of below 10 grams per liter. How does residual sugar content vary among quality ratings, can we expect range or a specific sweet spot for higher quality wines?
From the results of the plot we should notice the relatively slim margin included in the IQR, which is contained between 0.41-0.55 g/dm3. It doesn’t seem as though we are likely to will see much variation in sulphates for the wine samples, but we will continue our exploration of trends and observations.
We can see from this histogram for pH data a fairly symmetric distribution with a mean of 3.188, acidic but not quite as acidic as vinegar.
The distribution for alcohol appears skewed right with most wines in the range of 9.5-11.4 percent alcohol content. Is this range a factor that is reflected in quality ratings?
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
Most wines have a density of about 0.9921 with a mean of 0.9937 and a median of 0.9940. Considering the amount of alcohol and sugars in a particular wine, the density of wine should be similar to the density of water. A transformation of long tail data was performed to better understand the distribution for density.
## Warning: Removed 16 rows containing non-finite values (stat_bin).
Citric acid has an IQR in the range of 0.27-0.39. Considering that citric acid can add “freshness” and flavor to wines, would it be correct to assume that this is added to lower-quality wines to improve the rating?
## Warning: Removed 4 rows containing non-finite values (stat_bin).
Interested in understanding the distribution of data for fixed acidity, the results display the IQR for fixed acidity is within the range 6.3-7.3.
## Warning: Removed 2 rows containing missing values (geom_bar).
A transformation of long tail data was performed to better understand the distribution of volatile acidity, we can see a relatively normal distribution. The transformed distribution appears symmetric with the volatile acidity peaking at 0.27 g/dm^3 or so.
## Warning: Removed 2 rows containing missing values (geom_bar).
A transformation of long tail data was performed to better understand the distribution of chlorides, we can see a relatively normal distribution. The transformed distribution for chlorides appears symmetric with chlorides peaking at around 0.046 g/dm^3.
Sulfur dioxide is used as a preservative in wine and the histogram for total sulfur dioxide above shows a relatively normal distribution with a mean of 138.4 mg/dm^3. If there is dissolution above 50 ppm it becomes detectable in the wine, perhaps this is realized if we compare with quality ratings.
## Warning: Removed 41 rows containing non-finite values (stat_bin).
The histogram for free sulfur dioxide shows a shape that appears to be skewed right. The mean value for free sulfur dioxide is 46 mg/dm^3.
There are 4,898 white wines sampled in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality).
Other observations: Most white wines feature a quality rating of 6. The median alcohol content percentage is 10.51. Most wines have a density of of about 0.9921 with a mean of 0.9937 and a median of 0.9940. About 75% of wines have residual sugar consisting of below 10 g/dm^3. Citric acid has an IQR in the range of 0.27-0.39 g/dm^3.
The main point of interest in the dataset will be an attempt understand how the individual variables within relate to the discrete quality variable. For this part of the analysis I thought histograms would be best to provide the detail needed to start to understand our data. We created plots for each variable (excluding variable X). From the results of the histogram plots we can explore which balance of features are best for determining the discrete quality of a wine. I suspect there will be some common profiles for each quality rating, although determining a wine’s quality is dependent on the personal preferences of a wine steward.
Balance in taste is a key component in considering the quality of wine and individual preference in this matter is subjective. In our dataset balance will consider residual sugar, chlorides, volatile acidity, citric acid, and alcohol. I am interested in exploring whether there is any relationship among these individual variables and combinations of variables to better discern trends among the ordinal quality variable. If able to distinguish trends, it is possible that we could develop a predictive model for white wine quality.
Because free sulfur dioxide is used to prevent microbial growth and oxidation and sulphates to act as an antimicrobial and antioxidant of wine, perhaps we will be able to glean some interesting insights by comparing these variables against the quality variable later in the multivariate section.
No new variables were created from existing variables in the dataset.
To obtain a clearer understanding for the distribution of density, volatile acidity and chlorides data, log10 was used to limit outliers in the data. Also a bit further down in the bivariate analysis section, the cut function is used to transform the quality rating data into a table for a better comparison with other variables in the bivariate and multivariate analysis sections.
We will open the bivariate analysis section by cutting quality into new discrete variable carrying over their current rating score as titles. Then we will take a high-level view of correlation among combinations of variables, which is achieved by subsetting the dataset and removing variable “X”.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
Verifying the correlation between the variables in the dataset we can identify strong correlation among certain pairs (both in agreement and disagreement). Let’s plot the variable combinations and perform a correlation test, looking at stronger correlation coefficients. They include the following.
Correlation Coefficient Agreement: total sulfur dioxide vs. free sulfur dioxide fixed acidity vs. density residual sugar vs total sulfur dioxide chlorides vs. density quality vs. alcohol density vs. residual sugar density vs. free sulfur dioxide density vs. total sulfur dioxide
Correlation Coefficient Disagreement: residual sugar vs alcohol alcohol vs. density
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: free.sulfur.dioxide and total.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501
From the above results we see a correlation coefficient for the free sulfur dioxide, total sulfur dioxide pair as 0.615501, and a 95 percent confidence interval of 0.5977994 - 0.6326026.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and density
## t = 19.256, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2391013 0.2911738
## sample estimates:
## cor
## 0.265331
From these results we see a weaker correlation coefficient for the fixed acidity and density pair at 0.265331. The 95 percent confidence interval of 0.2391013 - 0.2911738.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 8 rows containing non-finite values (stat_smooth).
## Warning: Removed 8 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: residual.sugar and total.sulfur.dioxide
## t = 30.669, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3776791 0.4246712
## sample estimates:
## cor
## 0.4014393
From these results we see a decent correlation coefficient for the variable pair at 0.4014393. I wonder how total sulfur dioxide affects wine quality? The 95 percent confidence interval of 0.3776791 - 0.4246712.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 62 rows containing non-finite values (stat_smooth).
## Warning: Removed 62 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: chlorides and density
## t = 18.624, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2308679 0.2831779
## sample estimates:
## cor
## 0.2572113
The plot above shows a moderate correlation between variable pairs, density and chlorides, with the correlation coefficient at 0.2572113. Chlorides may be more closely associated with wine taste rather than having a big affect on the density of the wine.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 6 rows containing non-finite values (stat_smooth).
## Warning: Removed 6 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: residual.sugar and density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
Residual sugar appears to have more of an impact on density. This variable pair shows a strong correlation coefficient of 0.8389665, and a 95 percent confidence interval of 0.8304732 - 0.8470698.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and density
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5094349 0.5497297
## sample estimates:
## cor
## 0.5298813
The variable pair for total sulfur dioxide and density appears to show a fairly strong correlation, at 0.5298813.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: free.sulfur.dioxide and density
## t = 21.54, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2684156 0.3195836
## sample estimates:
## cor
## 0.2942104
This variable pair for fixed acidity and density appears to show a weaker correlation, at 0.2942104.
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 6 rows containing non-finite values (stat_smooth).
## Warning: Removed 6 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: residual.sugar and alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
In examing a few variable pairs with correlation disagreement we looked at two variable pairs with the strongest disagreement. The variable pair for residual sugar and alcohol features a correlation at -0.4506312, which appears to indicate that as alcohol content increases, residual sugar tends to decrease. I wonder how this factor plays out when compared with quality ratings?
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_smooth).
##
## Pearson's product-moment correlation
##
## data: alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
This variable pair, alcohol and density, has a correlation coefficient of -0.7801376, indicating that density decreases as alcohol content increases. To help visualize this relationship better we created a geom_line plot in addition to the geom_point plot.
Next let’s survey quality box plots for the variables associated with wine balance.
## w_w$quality: (2,3]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## --------------------------------------------------------
## w_w$quality: (3,4]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## w_w$quality: (4,5]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## w_w$quality: (5,6]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## w_w$quality: (6,7]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## w_w$quality: (7,8]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## w_w$quality: (8,9]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
As we can see from the results there is generally a postivie trend, as alcohol increases so does the quality of wine. The mean alcohol content for wine with a quality rating of 9 is 12.18.
## w_w$quality: (2,3]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2100 0.2575 0.3450 0.3360 0.3850 0.4700
## --------------------------------------------------------
## w_w$quality: (3,4]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1900 0.2900 0.3042 0.4000 0.8800
## --------------------------------------------------------
## w_w$quality: (4,5]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2400 0.3200 0.3377 0.4100 1.0000
## --------------------------------------------------------
## w_w$quality: (5,6]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.270 0.320 0.338 0.380 1.660
## --------------------------------------------------------
## w_w$quality: (6,7]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2800 0.3100 0.3256 0.3600 0.7400
## --------------------------------------------------------
## w_w$quality: (7,8]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.2800 0.3200 0.3265 0.3600 0.7400
## --------------------------------------------------------
## w_w$quality: (8,9]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.290 0.340 0.360 0.386 0.450 0.490
From this box plot we get to explore the variables quality with citric acid. the results indicate that citric acid content in wine remains relatively level across various wine quality ratings. There appears to be more outliers associated with wine rated with a 6 in quality.
## w_w$quality: (2,3]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1700 0.2375 0.2600 0.3332 0.4125 0.6400
## --------------------------------------------------------
## w_w$quality: (3,4]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1100 0.2700 0.3200 0.3812 0.4600 1.1000
## --------------------------------------------------------
## w_w$quality: (4,5]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.240 0.280 0.302 0.340 0.905
## --------------------------------------------------------
## w_w$quality: (5,6]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
## --------------------------------------------------------
## w_w$quality: (6,7]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2628 0.3200 0.7600
## --------------------------------------------------------
## w_w$quality: (7,8]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.2000 0.2600 0.2774 0.3300 0.6600
## --------------------------------------------------------
## w_w$quality: (8,9]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.240 0.260 0.270 0.298 0.360 0.360
Indicated by the plot results, volatile acidity mean is highes among wines rated with a 4 quality rating.
## w_w$quality: (2,3]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
## --------------------------------------------------------
## w_w$quality: (3,4]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
## --------------------------------------------------------
## w_w$quality: (4,5]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
## --------------------------------------------------------
## w_w$quality: (5,6]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## w_w$quality: (6,7]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## w_w$quality: (7,8]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
## --------------------------------------------------------
## w_w$quality: (8,9]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
In this box plot for variables quality and chlorides we see a generally flat trend with consistent chloride levels throughout each quality rating. There is however, a slight dip in chlorides at the 7, 8, and 9 quality ratings. The mean chloride content level for a 9 quality rating is 0.0274.
## w_w$quality: (2,3]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.587 4.600 6.393 10.700 16.200
## --------------------------------------------------------
## w_w$quality: (3,4]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## w_w$quality: (4,5]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## w_w$quality: (5,6]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## w_w$quality: (6,7]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## w_w$quality: (7,8]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## w_w$quality: (8,9]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
The results for variable pairs quality and residual sugar appear varied. An interesting point worht noting here is the mean residual sugar content for quality rating 4 (4.628) and 9 (4.12). Both ratings have similar levels of residual sugar. I’m not quite sure why this would be the case, perhaps this variable is affected more by personal preference than any other factor variable?
With this survey we explored quality rating (dependent) and the other variables affecting wine balance (independent), the five independent variables draw some interesting insights. The utilization of scatter plots with correlation coefficiants allowed us to examine for any patterns among independent variables, while the box plots helped to provide insights from these variables in order to see how they stack up against the dependent quality variable. This will help us later when we atttempt to explore various profiles of multiple variables.
One relationship worth noting is alcohol and quality. As alcohol increases, quality ratings appear to increase generally. Another interesting observation is that the means for citric acid and volatile acidity appear to remain flat as quality ratings improve. Also, chloride means tend to decrease slightly as quality ratings increase. Means for residual sugar fluctuate among quality ratings.
Some stronger correlation coefficients exist among density and residual sugar (0.882), total sulfur dioxide and free sulfur dioxide (0.724), and alcohol and density (-0.827). These stood out to me and will be used to further understand our data. I was expecting that free sulfur dioxide would content would vary depending on the relationship of other flavor variables, but it appears that the amount of free sulfur dioxide may be more related to wine volume.
The strongest correlation, as mentioned above, exist among density and residual sugar, which suggests that as residual sugar content increases so does density. However we will continue in the multivariate plots section with our exploration of how balance is expressed among quality ratings. These efforts will help fill in additional understang of distinct relationships.
From the multivariate plot above, when we look at alcohol content against residual sugar we’ll see that quality ratings generally trend with higher alcohol and lower residual sugar content. Alternately lower quality ratings show a general profile of more sugar and less alcohol.
Considering this combination of variables, we can say that white wines with higher residual sugar tend to have lower quality ratings, while white wines with high alcohol content tend to have higher quality ratings.
From the multivariate plot above, when we look at residual sugar content against density we’ll see that quality ratings generally trend with a lower residual sugar content and lower density. Considering this combination of variables, we can say that white wines with higher residual sugar content tend to have lower quality ratings, while white wines with a higher density also tend to have lower quality ratings.
Along with the trend in alcohol percent content related with quality rating identified earlier in the multivariate section (as alcohol content increases, quality rating also tends to increase), we can see from the results above that volatile acidity doesn’t have a strong correlation with quality in our dataset. This leads me to think that volatile acidity is controlled across all quality ratings. I wonder, if there are wines outside this controlled range of volatile acidity would they be used for vinegar?
## `geom_smooth()` using method = 'gam'
Based on the results of the multivariate plot above, when we look at residual sugar content with citric acid we can see that higher quality ratings generally trend within a certain citric acid content and dispersed across a range of residual sugar content. Lower quality ratings for citric acid are dispersed over a wider range. Considering these results for this combination of variables, we can say that white wines rated with a higher quality have a tighter range of citric acid content compared with wines of lower quality rating.
A trend can be interpreted from the above results that, outside of a few outliers, higher quality wine tends to have a lower chloride content, while the inverse tends to hold up for lower quality wines.
In the hope to examine the preservation of wine flavor and oxidation control, we can make the assertion from the above results that sulphate content is dispersed throughout quality ratings. Higher quality wines tend to have higher free sulfur dioxide content. However, the distribution for free sulfur dioxide generally appears in variation.
The multivariate correlation plots used in this section help us understand a more detail (identified trends or a lack of trend) among multiple combinations of variables. Specifically, in having the plots colored by quality, we were able to distinguish the impact multiple variables have in determining wine quality.
From the resulting plots in this section we can understand that higher quality wines generally feature higher alcohol and free sulfur dioxide content, and lower chloride and residual sugar content. Citric acid for higher quality white wines exists in a relatively strict concentration range. Residual sugar has a strong correlation with density, which again should be relatively low. As residual sugar increases so does density. When we see residual sugar compared with quality ratings however, there tends to be some variation but an overall trend of lower residual sugar content for higher quality ratings. These results help us attain a fairly decent profile of how well the variables we explored impact quality ratings
I wasn’t expecting the results attained from the free sulfur dioxide, sulphates, and quality multivariate analysis. I found it intriguing yet sensible that only a certain amount of free sulfur dioxide was needed to control wine flavor degradation. I was assuming that there may be a correlation some correlation between quality and perhaps the need for additional free sulfur dioxide, but it appears that there is some consistency with the amount of free sulfur dioxide required to maintain flavor.
## w_w$quality: (2,3]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## --------------------------------------------------------
## w_w$quality: (3,4]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## w_w$quality: (4,5]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## w_w$quality: (5,6]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## w_w$quality: (6,7]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## w_w$quality: (7,8]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## w_w$quality: (8,9]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
The box plot above shows trends among alcohol content compared with quality ratings. This provides a jumping off point in the exploration of understanding balance in white wine taste.
Citric acid and residual seems like they would have the most impact on perceived flavor for most people. This multivariate plot explores the variable pair, filled in by quality rating.
Alcohol and chlorides also have an affect on the balance of a wine, this multivariate analysis explores both variables filled in by quality rating to further explore flavor profiles.
Feeling out and uncovering insights of this dataset via exploratory data analysis is a test of patience, determination and objectivity. In the effort to weigh independent variables against the dependent variable for this white wine dataset we were able to uncover trends among variables related to balance and flavor to better understand how wine sommeliers rate wines.
One area I ran into difficulties with was in the correlation matrix plot. I found that I wasn’t quite successful in ensuring all the components of the matrix were sized right and that the labels were easy to read. Initially I also wanted to practice application of plot colors in the univariate section, but ultimately found the colors to be distracting and resorted to a more simplistic color palette.
Without knowing a great deal of detail regarding the wine quality ratings process I found it difficult to understand the ratings. Through this analysis we were able to uncover the qualities of variables for wine balance (residual sugar, chlorides, volatile acidity, citric acid, and alcohol). In weighing several variables against each other in the multivariate plots, we were able to successfully determine that higher quality wines generally feature higher alcohol and free sulfur dioxide content, and lower chloride and residual sugar content. Citric acid for higher quality white wines exists in a relatively strict concentration range, while residual sugar has a strong correlation with density, which again should be relatively low. And when residual sugar is compared with quality ratings there tends to be some variation among results but an overall trend of lower residual sugar content for higher quality ratings.
Considering the insights gleaned from this exercise it would be possible to further enrich this analysis by using additional data, as in comparing related data for red whine with this data for white wine. This could be used to identify similarites and differences between the two different types of wine, while understanding the balance of flavor for red wine samples. Future efforts to improve this analysis should include the development of a wine quality model to aide in the prediction of wine quality, based on the flavor variables we explored in this analysis. We could use this additional analysis to buy and sample our own collection of wines.